Least Squares Temporal Difference Methods: An Analysis under General Conditions

نویسنده

  • Huizhen Yu
چکیده

We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) with the least squares temporal difference (LSTD) algorithm, LSTD(λ), in an exploration-enhanced learning context, where policy costs are computed from observations of a Markov chain different from the one corresponding to the policy under evaluation. We establish for the discounted cost criterion that LSTD(λ) converges almost surely under mild, minimal conditions. We also analyze other properties of the iterates involved in the algorithm, including convergence in mean and boundedness. Our analysis draws on theories of both finite space Markov chains and weak Feller Markov chains on a topological space. Our results can be applied to other temporal difference algorithms and MDP models. As examples, we give a convergence analysis of a TD(λ) algorithm and extensions to MDP with compact state and action spaces, as well as a convergence proof of a new LSTD algorithm with state-dependent λ-parameters.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Convergence of Least Squares Temporal Difference Methods Under General Conditions

We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in the off-policy learning context and with the simulation-based least squares temporal difference algorithm, LSTD(λ). We establish for the discounted cost criterion that the off-policy LSTD(λ) converges almost surely under mild, minimal conditions. We also analyze other convergence and bounded...

متن کامل

Sustainable ℓ2-regularized actor-critic based on recursive least-squares temporal difference learning

Least-squares temporal difference learning (LSTD) has been used mainly for improving the data efficiency of the critic in actor-critic (AC). However, convergence analysis of the resulted algorithms is difficult when policy is changing. In this paper, a new AC method is proposed based on LSTD under discount criterion. The method comprises two components as the contribution: (1) LSTD works in an ...

متن کامل

Minimal Positive Stencils in Meshfree Finite Difference Methods for the Poisson Equation

Meshfree finite difference methods for the Poisson equation approximate the Laplace operator on a point cloud. Desirable are positive stencils, i.e. all neighbor entries are of the same sign. Classical least squares approaches yield large stencils that are in general not positive. We present an approach that yields stencils of minimal size, which are positive. We provide conditions on the point...

متن کامل

Regularized Policy Iteration

In this paper we consider approximate policy-iteration-based reinforcement learning algorithms. In order to implement a flexible function approximation scheme we propose the use of non-parametric methods with regularization, providing a convenient way to control the complexity of the function approximator. We propose two novel regularized policy iteration algorithms by addingL-regularization to...

متن کامل

Convergence Results for Some Temporal Difference Methods Based on Least Squares Citation

We consider finite-state Markov decision processes, and prove convergence and rate of convergence results for certain least squares policy evaluation algorithms of the type known as LSPE( ). These are temporal difference methods for constructing a linear function approximation of the cost function of a stationary policy, within the context of infinite-horizon discounted and average cost dynamic...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • SIAM J. Control and Optimization

دوره 50  شماره 

صفحات  -

تاریخ انتشار 2012